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ABSTRACT 

A Parallel Hierarchical algorithm for Global Routing ( PHIGURE) is presented in this paper. The 
router is based on the work of Burstein and Pelavinjl], but has many extensions for general global routing 
and parallel execution. Main features of the algorithm include structured hierarchical decomposition into 
separate independent tasks which are suitable for parallel execution and adaptive simplex solution for 
adding feedthroughs and adjusting channel heights for row-based layout. In this paper we will be examin- 
ing closely alternative decomposition methods and the various levels of parallelism available in the algo- 
rithm. The algorithm is described and results are presented for a shared-memory multiprocessor imple- 
mentation. 


I. INTRODUCTION 

The computational requirements for high quality synthesis and analysis of VLSI designs far outpaces 
the rapidly growing complexity of VLSI designs. One approach to handle the complexity problem has 
been to apply parallel processing to certain Computer-Aided Design (CAD) applications^] because of the 
advantages of being able to solve larger problems sizes, achieve high quality results, and affordability of 
the low cost multiprocessors. Along with global routing, research in parallel processing for CAD has 
included the tasks of floor planning [3], cell placement [4,5, 6, 7,8], circuit extraction [9], and test 
generation/fault simulation [10]. This research has demonstrated the wide variety of CAD applications that 
can be solved with parallel processing. 
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In this paper, we present a new parallel algorithm for global routing called PHIGURE, a Parallel 
Hierarchical Global Router. The task of global routing is to take a netlist, a list of pin positions, and a 
description of the available routing resources and determine the connections and macro paths for each 
net. Figure 1 shows a simple global routing problem for a chip with pads(P) and standard cells(C) in rows. 
A global router must make choices beween alternative paths for a net. Some criteria used to evaluate the 
quality of the routing include: total net length, total chip area, the number of tracks required (row-based 
routing), the number of feedthroughs used, and the number of vias required. For row-based layout, the 
output of the global router is used to set up the channels to be routed by a channel router. 

Previous research in uniprocessor global routing can be basically divided into these categories: 
minimum spanning tree solutions [11,1 2], maze routing [13], physical analogies [14, 15, 1 6], and hierarchi- 
cal routing [1 , 17, 18]. Minimum spanning tree solutions model net connections as a spanning graph and 
try to reduce the graph to a tree while minimizing a cost function. In order to be effective, however, these 
solutions must handle the net ordering problem. Maze routing solutions apply a line/wave expansion algo- 
rithm to route one net at a time. Again, the net ordering problem affects the quality of the results. Physical 
analogy approaches have modeled the routing problem into the framework of concepts like simulated 
annealing, attractive and repulsive forces, and electromagnetics. Top-down and Bottom-up hierarchical 
approaches have also been studied. Other research work on routing has concentrated on combinations of 
these approaches along with the net rip-up and reroute technique. 

In the past, several researchers have proposed parallel approaches to the global routing problem. 
One approach was to develop a maze routing algorithm suitable for a special purpose hardware routing 
machine, made up of a 2-D array of microprocessors [19]. Similarly, a maze router was implemented on 
the AAP-1 2-D array processor [20]. Two other algorithms for maze routing have been developed, 
specifically for the hypercube [21,22], A different approach, developed by Rose for shared-memory mul- 
tiprocessors [23], determines the best of possible two-bend routes for each two-pin subnet of each net. 

Along with the problem of net order dependence, these parallel routing approaches also suffer from 
routing quality degradation. Thus, since hierarchical routing methods not only route all nets at the same 
time and incur no routing degradation with parallelism, but also are useful in handling large and complex 
routing problems, we have developed a parallel top-down hierarchical router. 
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In Section II of this paper we discuss our global routing model, and the hierarchical routing tech- 
nique. In Section III we provide an overview of the decomposition strategies applied to subdivide and 
solve the problem, along with other issues in the algorithm design. Implementation details and parallel 
processor results for the algorithm are given in Section IV. 

II. THE GLOBAL ROUTING FORMULATION 


2.1. Global Routing Model 

The global routing model we are using is similar to that of Burstein and Pelavin [1], The entire layout 
area (including pads) is divided into a two-dimensional array of routing cells. Each routing cell is assigned 
routing capacity information for each of its four boundaries, based on the physical dimensions of the rout- 
ing cell. This provides constraints on the number of nets that can be routed through the edges of the rout- 
ing cell, as in Figure 2. 

At each level of the hierarchical decomposition, the current set of routing cells is divided into four 
regions, forming a two-by-two array of supercells. Thus, each supercell will encompass a sub-region of 
the layout area. These supercells are further divided at later steps of the decomposition. Each net is cast 
into one of 15 net types, based on the presence of pins in each of the four supercells. The net types con- 
sisting of two or more pins are shown in Figure 3, along with the possible routings for each. Such a formu- 
lation was proposed by Burstein and Pelavin [1]. 

A linear (integer) programming formulation of the problem (LP) is defined such that 

For all x, MAX (px) 
subject to Ax z a and Bx = b, 

where x represents the variable space, p represents the objective function, A and a represent the inequal- 
ity constraints, and B and b represent any equality constraints. In our problem, the set of variables, 
x/, 0</<27, represent the 28 possible net routings and the set of 15 constraints is based on the available 
routing capacities and the types of nets being routed. The cost function is designed to minimize intercon- 
nection lengths of the nets. The resulting values of the variables x, represent the number of nets routed in 
the particular pattern which the variable represents. After a solution to the LP is found, the nets are 
assigned to the appropriate configuration. 
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2.2. Estimating Routing Capacities 

In the routing capacity model, it is sufficient for each routing cell to maintain capacity information for 
only two of its four shared edges (for example the top and right edges). Denote the vertical capacity for a 
routing cell in row r and column c as v rx (across the top edge), and the horizontal capacity as h r , c (across 
the right edge). Let L, R, T, and B be the left, right, top, and bottom edges (rows and columns) of the 
region to be solved. Let X and /be the locations of the vertical (/) and horizontal (X) axes respectively 
of the two-by-two supercell array. Let CAP,, i e A,B,C,D represent the capacities of the four axis seg- 
ments in clockwise order around the two-by-two supercell array, as shown in Figure 4(a). Then 

CAP a = min{hj' X -t,hi,x,hi,x+\) 

CAP g = min(v Y -yj,v Y ,i,VY^j) 

CAP c = min(hi,x-uhi,xAx^) 

CAP 0 = £ min(v Y ^j,v Y j,v Y+h i). 

This scheme quickly estimates the capacity of the axes with little chance of overestimating by concentrat- 
ing on the regions closest to the axis. Cases in which the cell capacities are nonuniform near an axis are 
handled as well. Figure 4(b) illustrates the capacity estimation for the example in Figure 2. 

2.3. Feedthrough Insertion and Channel Width Expansion 

In row-based layout, feedthroughs must be inserted into the rows to make connections when no 
built-in feedthroughs or equivalent pins are available when connections must be made from raw , to row i+2 . 
PHIGURE handles the problem through the simplex computations. After the problem has been set up, if 
sufficient routing facilities are available, a solution will be found, else the simplex algorithm will terminate 
with an infeasible initial problem. By analyzing the simplex state and the given routing problem, adjust- 
ments to certain capacities will provide a feasible initial problem for the simplex algorithm. Adjustments to 
CAP a and CAP c are equivalent to increasing the channel width. Adjustments to CAP B and CAP 0 are 
equivalent to inserting feedthroughs in the row along the X axis. This technique has proven very effective 


in PHIGURE. 
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2.4. Hierarchical Decomposition 

As mentioned earlier, we are applying two-dimensional hierarchical decomposition methods to the 
global routing problem. At each stage of the hierarchy we divide a larger problem into four smaller sub- 
problems (divide and conquer). Deciding how to partition the subproblems so that they are independent of 
each other is very important. One primary decision has to do with how net-crossing locations along the 
boundaries between the subproblems are determined and locked in place. We have investigated two 
approaches which are discussed in the following sections. 

2.4.1. Maximal Boundary Determination 

The first strategy completely determines the net crossing locations by recursively decomposing 
along the axes of interest down to the routing cell level. This strategy is computationally more costly than 
the one to be discussed in the next section, but the advantage is that the boundary interface is determined 
hierarchically as well. Figure 5 shows the first steps in the decomposition for this strategy. The nodes of 
the graph represent a complete solution of a two-by-two routing instance. The arcs of the graph represent 
dependencies from child nodes (below) to their parent node (above). In Step 1 and Step 2, the top-most 
two-by-two solution is followed first by the recursive subdivision and solution of the X axis, down to the 
level of individual routing cells, and second by the recursive subdivision and solution of the Y axis. After 
completing these steps, the net crossings have been completely determined and locked into place along 
both axes of the two-by-two supercell problem, and the four sub-problems for Step 3 are completely 
independent of each other. This sequence of steps is then recursively repeated until the net crossings 
across all routing cell edges have been determined. This strategy utilizes the maximum number of two- 
by-two routing solutions. 

2.4.2. Minimal Boundary Determination 

Figure 6, shows the first steps in the hierarchical decomposition for this second strategy. The top- 
most two-by-two problem is solved (Step 1), followed by quick heuristic approximations of the crossings of 
nets. The four subproblems are then completely Independent in Step 2. These steps are repeated recur- 
sively until the routing cell level (supercell = routing cell) is reached. This strategy utilizes the fewest 
two-by-two routing solutions for a hierarchical routing. 
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The strategy of the minimal determination of the boundary lines is by far the fastest since the number 
of nodes in the graph (or solutions of two-by-two routing instances) is much less than for the Maximal 
Boundary Determination strategy; however, there is a trade-off in the expected quality of the solution for 
computation speed. The routing difficulty comes because without a costly complete analysis, it is 
extremely hard to best determine exactly where along the boundaries each net should. Some approxima- 
tions based on the pin locations of each net are used to estimate the crossing, but if the boundaries are 
not well predicted, the quality of the routing will be severely degraded starting from the top-most two-by- 
two solution (Step 1). The Maximal strategy takes the extra effort to completely analyze the routing con- 
straints in a hierarchical fashion. 

2.5. Task Complexity 

In the previous sections, we have discussed some of the basic elements of the two-by-two solution 
task. These are summarized as follows: 

1. Evaluate pin types. 

2. Set up linear programming formulation. 

3. Solve linear/integer program. 

4. Assign routing pattern to each net. 

5. Subdivide area for next level of hierarchy. 

6. Repeat with child nodes. 

THEOREM 1: The complexity of a solution of a task is O(n), where n is the number of nets. 

Proof: We will show that each subtask is 0(n) in the worst case. A circuit is assumed to have p ~kn, 
where p is the number of pins or net terminals, k is a constant equal to the average number of pins per 
net, and n is the number of nets in the circuit. Thus p is O(n). 

1. To evaluate each pin type requires a search for pins in the current region. This operation is 
0{p)~0(n). 

2. Each net is assigned to a specific linear program variable based on the characteristics of the net’s 
pins. This subtask is 0{n). 

The simplex solution of a linear program (with 27 variables, a fixed number independent of the prob- 
lem size) can be shown to terminate in a finite number of pivots (steps) provided proper pivoting 


3 . 
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techniques are used. We are also applying cutting plane methods to convert the linear program 
solution into an integer solution [24], Measurements taken show that the average number of pivots 
in the simplex solution to be less than 6. 

4. The current implementation utilizes a very simple assignment algorithm which runs in O(n). 

5. Subdivision of the current two-by-two region and setup for the next level of the decomposition can be 
done in constant time. 

Thus, the complexity of a task solution is 0{n). 

QED 

The total complexity of each strategy is the product of the task complexity and the total number of 
tasks (nodes), which is proportional to the number of routing cells. If M is the total number of routing cells 
and n is the number of nets, then the computational complexity of PHIGURE is O(nM). 

2.6. Experimental Results on Task Complexity 

In the following figures, the measurements were taken on the Encore Multimax executing the Maxi- 
mal Strategy on the Primary 1 benchmark. The iteration number refers to the task solution number in a 
depth-first trace of the execution graph. Figure 7 shows the time taken to setup the net types before the 
LP solution for each of the task solutions. The average time is 12.9 ms; the standard deviation is 1 .2 ms. 
Figure 8 shows the time taken to solve the given LP problem for each task solution. The average time is 
5.7 ms; the standard deviation is 5.3 ms. Figure 9 shows the execution time to assign the net types to a 
specific configuration for each task solution. The average time is 1.0 ms; the standard deviation is 0.6 ms. 
Figure 10 shows the total execution time for each task solution. The average time is 19.6 ms; the stan- 
dard deviation is 5.6 ms. 


III. PARALLEL ALGORITHM OVERVIEW 

3.1. Exploitation of Coarse-Grained Parallelism 

Since the ratio of execution time to synchronization/communication time for the nodes of the execu- 
tion graph is very large, these tasks are considered to be coarse-grained. 
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The parallel execution of a binary tree is a well known paradigm, and as we discussed in previous 
sections, the hierarchical routing execution in PHIGURE takes the form of a binary tree in which the nodes 
of the tree represent the LP set-up, the LP solution, and the net assignments for a single two-by-two rout- 
ing problem. Furthermore, each node of the tree that is currently being evaluated is completely indepen- 
dent of all other active nodes. The local information for the current sub-problem is derived from its parent 
node’s data structures and global pin location information which is strictly read-only. The solution of the 
routing subproblem causes the executing process to write the results to a global (shared) output data 
structure. Since the tasks are spatially independent, there are no contention or race hazards as a process 
writes out its results. 

After writing the results, the process creates two child routing subproblems. One child subproblem is 
assigned to the first idle and waiting process. The second child subproblem is then executed by the 
parent itself. If no processes are waiting, the parent will proceed to execute the first subproblem, followed 
by the second. The number of processes created and initially available for subproblem solution is set 
equal to the number of processors available to the user. 

The routing solution complexity and speedup under parallel execution for both decomposition stra- 
tegies are estimated in the following sections. 

3.1.1. Maximal Boundary Determination 

Given R rows and C columns of routing cells, the required number of evaluations to solve the verti- 
cal segments of all routing cells in the maximal decomposition strategy is (R - 1) x (C - 1). Likewise, the 
required number of evaluations to solve the horizontal segments is (C - 1) x {R - 1). However, one verti- 
cal and one horizontal component is solved at each iteration, so the total number of evaluations N 2x 2 is 

A/ 2x2 =(H-1)(C-1). 

This expression has been verified through actual runs of PHIGURE. The estimated execution time for one 
process is then 

T 1 = T 2 ^ 2 (/?-1)(C-1), 

where 72*2 is the average time to solve a single two-by-two routing problem as a linear function of the 
number of nets n. The estimated execution time Tp for P processes is equal to the time spent executing 
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until all P processes are activated plus the time spent in full parallel execution: 


Tp - ( T 2x2 + Tsync) 


2log2P + log 4 P — 2 + 


(R - 1)(C - 1) - 7P 3 — 


where T syn c is an estimation of the time spent in synchronization. After simplifying the expression, we get 


Tp -(7*2x2+ T S ync)(- 
The expected speedup is then 


(P-1HC-1) 13P -13 , 5 


W 


h^\og 2 P)- 


o T _ T 2x 2 6P(fl - 1 KC - 1) 

SP ~17~ V 2x2+ 7 sync) 6(P-1)(C“1)-26P + 2b+15PIOg2P- 


3.1.2. Minimal Boundary Determination 

Again, given fl rows and C columns of routing cells, Z = min{R,C), the required number of node 
tasks to solve is 


N 2x2 ^ 


log^-1 

1 


4' = 


Z 2 - 1 
~ 3“ ' - 


in which the equality holds for cases when log 2 Z is an integer. The estimated time for completion for one 
process is W 22 xT 22 . Again, the estimated execution time for P processes is equal to the time spent exe- 
cuting until all P processes are activated plus the time spent in full parallel execution: 


Tp = ( T 2x2 + 


Tsync) 


log 4 P + 


Z2-1 

3 


F 


P-1 

~T~ 


After simplifying the expression, we get 


Tp = ( T 2 x 2 + T sync ) ( + ^-IOg2P). 

The expected speedup is then 


o _ 7"i _ 7~ 2 x2 2P(Z 2 - 1) 

P Tp ( 7 2x2 + Tsync) 2Z^-2P + 3Plog 2 P' 

Figure 11 provides a graphical look at the previous set of equations assuming that 7 ^ = 0.1 . 

Included in the plot is an estimate of process efficiency (useful time/ total time) and its effect on the possi- 
ble speedup. The current implementation provides dynamic task scheduling based on process availability. 
Thus, due to task granularity, there will be times when a process waits idle for a new task to be generated. 
As the number of processes increases, the process efficiency is expected to decrease. 
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3.2. Exploitation of Fine-Grained Parallelism 

There are three specific subtasks which can be executed in parallel at a fine-grained level. First, 
during the LP setup, the type for each net of the current two-by-two problem is determined. Since each 
net is independent, the nets may be divided between available processes and evaluated in parallel. 
Second, the exchange operations required to solve the linear/integer program may also be divided 
between available processes for parallel execution. Finally, the assignment of nets could also be done in 
parallel, based on specific net types. Each of these areas of parallelism are orthogonal to each other. 

However, since the amount of parallelism available at the task level is so great, the exploitation of 
parallelism at the fine-grain level would not provide significant improvement. Only during the startup 
phase of the execution tree will there be specific processes idle. Figure 12 shows the percentage of the 
number of two-by-two solutions in the startup phase to the total number of two-by-two solutions for routing 
problems with R = C = Z and P = 16. As is clear from the figure, the part of the execution in large sized 
problems for which fine-grain parallelism can be useful is extremely small. Furthermore, parallelism of the 
simplex solution would not be effective since the average number of pivoting operations for solution is less 
than 6. Therefore, we determined that is was unnecessary to implement these tasks at such a fine-grain 
level. 


IV. IMPLEMENTATION AND RESULTS 

PHIGURE was implemented using approximately 5000 lines of C code on an eight processor Encore 
Multimax 510 (shared memory multiprocessor), in which each processor is a National 50310 CPU. Exper- 
iments were performed on a few of the placement and routing benchmarks from the MCNC Workshop on 
Placement and Routing, along with a number of other circuits. Testing was done for a single process, two 
processes, four processes, and eight processes. 

Table 1 compares the routing results of PHIGURE to actual runs of the TimberWolf 5.4 global router 
(TW) [11], and some of the recently published results for the UTMC router (UT) [11], a router by Cong and 
Preas (CP) [12], and Locusroute (LR) [23], This table shows that PHIGURE performs well within the range 
of some recently published routers. Table 2 compares the uniprocess runtimes for the TimberWolf 5.4 
router with those of PHIGURE. These measurements were taken on a Sun 3/1 1 0 workstation. 
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Table 3 shows the results for two of the Placement and Routing Workshop benchmark circuits and 
three other standard cell circuits. For each circuit, the table gives the number of tracks used as estimated 
by the maximum channel density and the average execution times in seconds (real time, including pro- 
cess creation) for one, two, four, and eight processes using the Minimal and Maximal decomposition stra- 
tegies. Cell placements for all of the circuits were performed by TimberWolf 5.4. As is clear from the 
table, there is no degradation in routing quality when going from a single process to many processes, and 
very good speedups were achieved (>6 for 8 processes). Since the hierarchical decomposition creates a 
large number of jobs after the first few steps, our algorithm is scalable for a large numbers of processes. 

V. CONCLUSIONS 

In this paper we have presented a new parallel global router, PHIGURE, which applies hierarchical 
routing and decomposition techniques to create independent subproblems which can be evaluated in 
parallel. Results were presented which compare two strategies for decomposing the routing problem and 
show that high quality routings are attainable for one strategy. Most importantly, the routing quality is not 
v degraded by decomposing in parallel. We are currently investigating improved task scheduling schemes, 
improved net assignment techniques, incorporating fine grain parallelism, implementation on a message- 
passing multiprocessor, extensions to a combined place and route algorithm, and strategies for adaptive 
rerouting of problem nets. 
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Table 3. Parallel Algorithm Results. 
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1 

2 

4 

_ 8 _ 

1 

2 

4 

_ 8 _ 

1 

2 

4 


Trks 

348 

348 

348 

348 

817 

817 

817 

817 

641 

641 

641 


Time(sec) 

33 

17 

9 

6 

187 

97 

52 

30 

189 

92 

47 


SpdUp 

1.0 

1.9 

3.7 

5.5 
1.0 
1.9 

3.6 
6.2 
1.0 
2.0 
4.0 


Trks 

177 

177 

177 

177 

404 

404 

404 

404 

416 

416 

416 


Max Decomp 


Time(sec) 

66 

34 

21 

14 

257 

131 

74 

42 

202 

101 

59 


SpdUp 

1.0 

1.9 

3.1 
4.7 
1.0 
1.9 
3.5 

6.1 
1.0 
2.0 
3.4 



Circuit X3 
(3258) 


00 n} VO o 









































